-
Notifications
You must be signed in to change notification settings - Fork 10
Add node debug tool with tests #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: harche The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
/hold cancel |
/hold for fixing CI issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @harche
The code looks good overall, left a few comments throughout
I'll test this on an OpenShift cluster tmrw
pkg/toolsets/core/nodes.go
Outdated
internalk8s "github.com/containers/kubernetes-mcp-server/pkg/kubernetes" | ||
) | ||
|
||
func initNodes(_ internalk8s.Openshift) []api.ServerTool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe we can drop the internalk8s.Openshift
parameter here since we don't need it? For the initXYZ
methods, there's no requirement on this parameter existing - it seems to be present in only some of them
func initNodes(_ internalk8s.Openshift) []api.ServerTool { | |
func initNodes() []api.ServerTool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
pkg/kubernetes/nodes.go
Outdated
// | ||
// When namespace is empty, the configured namespace (or "default" if none) is used. When image is empty the | ||
// default debug image is used. Timeout controls how long we wait for the pod to complete. | ||
func (k *Kubernetes) NodesDebugExec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to split this function up a little bit? IMO it is getting quite large and is responsible for too much
Maybe we can create functions that:
- create the debug pod
- poll for debug completion
- Retrieve the logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, thanks.
pkg/kubernetes/nodes.go
Outdated
// nodeDebugContainerName is the name used for the debug container, matching oc debug defaults. | ||
nodeDebugContainerName = "debug" | ||
// defaultNodeDebugTimeout is the maximum time to wait for the debug pod to finish executing. | ||
defaultNodeDebugTimeout = 5 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to lower this timeout as by default this will significantly exceed the client tool call connection timeout: https://github.com/modelcontextprotocol/typescript-sdk/blob/e0de0829019a4eab7af29c05f9a7ec13364f121e/src/shared/protocol.ts#L60
We probably also want to add support for progress notifications to be sent to the client for longer running tool calls like this one (cc @mrunalp @ardaguclu @matzew @manusa )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduced to 1 min, thanks.
pkg/kubernetes/nodes.go
Outdated
grace := int64(0) | ||
_ = podsClient.Delete(deleteCtx, created.Name, metav1.DeleteOptions{GracePeriodSeconds: &grace}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's use ptr.To here like we do elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactored file at pkg/ocp/nodes_debug.go, now uses ptr.To
. Thanks.
defaultNodeDebugTimeout = 5 * time.Minute | ||
) | ||
|
||
// NodesDebugExec mimics `oc debug node/<name> -- <command...>` by creating a privileged pod on the target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is ocp specific;
would it make sense, to group this functionality into some pkg/ocp
package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File pkg/ocp/nodes_debug.go
is part of ocp
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold on, I will move everything this PR adds from pkg/kubernetes
to pkg/ocp
so future rebasing avoids the conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@harche would it be possible to improve the error messages we provide?
When running this server with claude code claude was able to call the tool, but it frequently got errors such as:
⏺ ocp-debug - Nodes: Debug Exec (MCP)(node: "ip-10-0-112-253.us-east-2.compute.internal", command: ["systemctl","status","kubelet"])
⎿ Error: command exited with code 1 (Error)
I'm not sure if there is a way to get more info about what went wrong, but with the current lack of error messages it was hard for the agent to figure out what went wrong and how to fix it
pkg/kubernetes/kubernetes.go
Outdated
type Kubernetes struct { | ||
manager *Manager | ||
manager *Manager | ||
podClientFactory func(namespace string) (corev1client.PodInterface, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't these changes in here cause the divergence from upstream?. I think, in this repository we shouldn't allow any changes touching these packages such as kubernetes/, mcp/, etc.
My suggestion is to add first in upstream and simply use it in downstream. Downstream repository should touch only ocp/
directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, working on it, #38 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
c99af16
to
694501c
Compare
/test test |
return slices.Concat( | ||
initEvents(), | ||
initNamespaces(o), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO - let's not add any openshift-specific tools to the core toolset. I'm worried this may lead to conflicts in the future if we make any upstream changes.
Instead, let's create a new openshift specific toolset (maybe openshift-core) that will hold all the openshift only tools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cali0707 thanks for that feedback. Created new pkg/toolsets/openshift/
package with the openshift-core
toolset, also moved nodes_debug_exec
from core
to openshift-core
We return whatever the exit code and error message we get from executing that command. |
@harche: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/hold cancel |
if terminated != nil { | ||
if terminated.ExitCode != 0 { | ||
errMsg := fmt.Sprintf("command exited with code %d", terminated.ExitCode) | ||
if terminated.Reason != "" { | ||
errMsg = fmt.Sprintf("%s (%s)", errMsg, terminated.Reason) | ||
} | ||
if terminated.Message != "" { | ||
errMsg = fmt.Sprintf("%s: %s", errMsg, terminated.Message) | ||
} | ||
return logs, errors.New(errMsg) | ||
} | ||
return logs, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In here can you include the logs in the error message? If you look at how the eventual result to the client is created, the logs will not be returned in the event where there is an error. This makes it hard to know what went wrong
select { | ||
case <-pollCtx.Done(): | ||
return nil, nil, "", fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err()) | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this select
, as there is one at the bottom of the loop that is checking the same condition
select { | ||
case <-pollCtx.Done(): | ||
return nil, nil, "", fmt.Errorf("timed out waiting for debug pod %s to complete: %w", podName, pollCtx.Err()) | ||
case <-ticker.C: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment here explaining that this is to wait before the next check? This confused me when I first ran into it at the bottom of the loop
import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/config" | ||
import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/core" | ||
import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/helm" | ||
import _ "github.com/containers/kubernetes-mcp-server/pkg/toolsets/openshift" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to a openshift_modules.go
to avoid merge conflicts with usptream if we introduce more toolsets there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. Keeping it separate will ease the catching up with upstream on this branch.
return &StaticConfig{ | ||
ListOutput: "table", | ||
Toolsets: []string{"core", "config", "helm"}, | ||
Toolsets: []string{"core", "config", "helm", "openshift-core"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we want to add this to the default config here
On the one hand, it makes sense to have the openshift tools enabled by default in the openshift fork
However, this introduces lots of changes to the other config/toolset tests as now they all expect the openshift-core
toolset to be there
I'm worried this will lead to a load of conflicts whenever we make changes to those upstream...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts @matzew @manusa @mrunalp @ardaguclu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am fine w/ just running with explicit config for these extra tools, since it is a fork?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - my only concern is a lot of the tests rely on this default config - see the large diff on the tests
This may cause a lot of diffs if we make changes to those upstream
- **nodes_debug_exec** - Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). Commands execute in a chroot of the host filesystem, providing full access to node-level diagnostics. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs. | ||
- `command` (`array`) **(required)** - Command to execute on the node via chroot. All standard debugging utilities are available including systemctl, journalctl, ss, ip, ping, traceroute, nmap, ps, top, lsof, strace, find, tar, rsync, gdb, and more. Provide each argument as a separate array item (e.g. ['systemctl', 'status', 'kubelet'] or ['journalctl', '-u', 'kubelet', '--since', '1 hour ago']). | ||
- `image` (`string`) - Container image to use for the debug pod (optional). Defaults to registry.access.redhat.com/ubi9/toolbox:latest which provides comprehensive debugging and troubleshooting utilities. | ||
- `namespace` (`string`) - Namespace to create the temporary debug pod in (optional, defaults to the current namespace or 'default'). | ||
- `node` (`string`) **(required)** - Name of the node to debug (e.g. worker-0). | ||
- `timeout_seconds` (`integer`) - Maximum time to wait for the command to complete before timing out (optional, defaults to 300 seconds). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-generate this? This tool isn't in the core toolset anymore 😄
|
||
- **projects_list** - List all the OpenShift projects in the current cluster | ||
|
||
- **nodes_debug_exec** - Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). Commands execute in a chroot of the host filesystem, providing full access to node-level diagnostics. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we might want to group the tools into "kube" and "ocp" tools?
(Not relevant now, but generally wondering CC @manusa)
Tested with claude code,